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SUMMARY 


Recently the GMRESR inner-outer iteration scheme for the solution of linear systems of equations has 
been proposed by Van der Vorst and Vuik. Similar methods have been proposed by Axelsson and 
Vassilevski [1] and Saad (FGMRES) [10]. The outer iteration is GCR, which minimizes the residual over a 
given set of direction vectors. The inner iteration is GMRES, which at each step computes a new direction 
vector by approximately solving the residual equation. However, the optimality of the approximation over 
the space of outer search directions is ignored in the inner GMRES iteration. This leads to suboptimal 
corrections to the solution in the outer iteration, as components of the outer iteration directions may 
reenter in the inner iteration process. Therefore we propose to preserve the orthogonality relations of GCR 
in the inner GMRES iteration. This gives optimal corrections; however, it involves working with a singular, 
non-symmetric operator. We will discuss some important properties and we will show by experiments that, 
in terms of matrix vector products, this modification (almost) always leads to better convergence. However, 
because we do more orthogonalizations, it does not always give an improved performance in CPU-time. 
Furthermore, we will discuss efficient implementations as well as the truncation possibilities of the outer 
GCR process. The experimental results indicate that for such methods it is advantageous to preserve the 
orthogonality in the inner iteration. Of course we can also use other iteration schemes than GMRES as the 
inner method. Especially methods with short recurrences like BICGSTAB seem of interest. 


INTRODUCTION 


For the solution of systems of linear equations the so-called Krylov subspace methods are very popular. 
However, for general matrices no Krylov method can satisfy a global optimality requirement and have short 
recurrences [5]. Therefore either restarted or truncated versions of optimal methods, like GMRES [11], are 
used or methods with short recurrences, which do not satisfy a global optimality requirement, like BiCG 
[6], BICGSTAB [14], BICGSTAB(Z) [12], CGS [13] or QMR [8]. Recently Van der Vorst and Vuik proposed 
a class of methods, GMRESR [15], which are nested GMRES methods; see Fig. 2. The GMRESR 
algorithm is based upon the GCR algorithm [4]; see Fig. 1. For a given initial guess xo, they both compute 
approximate solutions x^, such that x^ — xo 6 span{u\j U 2 } . • » , u^} and ||r *.||2 = ||6 — Ax/d |2 is minimal. 

1 Delft University of Technology, Faculty of Technical Mathematics and Informatics, P.O. Box 5031, NL-2600 GA Delft, The 
Netherlands, E-mail: witaedsCdutinfh.tudelft.nl. The author wishes to acknowledge Shell Research B.Y. and STIPT for 

the. financial support of his research. 

2 Mathematical Institute, University of Utrecht, P.O. Box 80.010. NL-3508 TA Utrecht, The Netherlands, E-mail: 

fokkemaflmath.ruu.nl. This work was supported in part by a NOF/Cray Research University Grant CRG 92.03 


PnfiGi OHMG PAGE BLANK NOT FILMED 


PAGE 



INTENTIONALLY BLANK* 


111 



Ullj 


GCR: 

1. Select xo, m, tol; 

r 0 = 6 — Axq, k = 0; 

2. while (|rfc ||2 > tol do 

k = k + 1; 

u* = ffc-i; Cfc = Au k\ 
for * = 1, k — 1 do; 
= cjc k ; 

Ck = Cfc - aicn 


u k = u k - a,u i; 

Ck = c fc/|l c fc||i 

Uk = ufc/IMI; 

x k = Xk - i + (4V fc -i)«fc; 

Tk = T k - 1 - (Cfcr fc -i)c fc ; 


Figure 1: The GCR algorithm 


GMRESR: 

1 . Select xo, m, tol ; 

7-0 = 6 - Ax o, k = 0; 

2. while ||rjfc ||2 > tol do 

k — k + 1; 

u k = 1; c fc = 

for t = 1, . . . , k — 1 do 

a* = c k , 

Cfe = Ck 

u k = u k - aiUi ; 

Cfc = c fe/|| c fc||2j 

Ufc = Ufc/||c*;||2; 

Xfc = Xfc-1 + (c£r fe _i)u fc ; 
rfc =r fc _i - (cj[Vfc-i)cfc; 

V m ,k(A) indicates the GMRES polynomial that 
is implicitly constructed in m steps of GMRES 
when solving Ay = T k _\. 

Figure 2 : The GMRESR algorithm 


However, they compute different direction vectors u k . GCR sets u k simply to r k -i, while GMRESR 
computes u k by applying m steps of GMRES to r k -i (represented by V m ,k(^) T k - 1 in Fig. 2). The inner 
GMRES iteration computes a new search direction by approximately solving the residual equation and 
then the outer GCR iteration minimizes the residual over the new search direction and all previous search 
directions u». The algorithm can be explained as follows. 

Assume we are given the system of equations Ax = b, where A is a real, nonsingular, linear (n x n)-matrix 
and 6 is a n- vector. Let U k and C k be two (n x fc)-matrices for which 

C k = AU k , CZC k = I k , C 1 ) 

and let io be an initial guess. For x k - xo € range(U k ) the minimization problem 

||6-Az fc || 2 = min ||r 0 - Ax|| 2 - ( 2 ) 


is solved by 


Xfc = xq + U k C^ro 


(3) 


and r k = 6 — Ax* satisfies 


r k = r Q - C k C k ro, r k -L range(C k ) ■ 


(4) 


In fact we have constructed the inverse of the restriction of A to range(U k ) onto range(C k ) . This 
inverse is given by _ „ 

A-'C.Cj = V k C T k . (5) 

This principle underlies the GCR method. In GCR the matrices U k = [ui U'i - ■ ■ and C k [ci C 2 . . • c k ] 
are constructed such, that range(U k ) is equal to the Krylov subspace 

K k (A; rQ ) = span { r0) Ar 0 , . • ■ , A fc_1 r 0 } . Provided GCR does not break down; i.e. if c k / r*_i, it is a finite 
method and at step k it solves the minimization problem (2). 


i jfcSB'- 
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Consider the Jfc-th step in GCR. Equations (l)-(3) indicate that if in the update u k = r*_i (in GCR), we 
replace r k - 1 by any other vector, then the algorithm still solves (2); however, the subspace U k will be 
different. The optimal choice would be u k — e k -i 1 where e k -\ is the error in x k ~\. In order to find 
approximations to e^-i, we use the relation = r k ~ \ and any method which gives an approximate 

solution to this equation can be used to find acceptable choices for u k . In the GMRESR algorithm 
GMRES(m) is chosen to be the method to find such an approximation. 

However, since we already have an optimal such that Xfc_i — xo £ range(U k -\) , we need an 

approximation u k to such that c k = Au k is orthogonal to range(C k - 1 ) . Such an approximation is 

computed explicitly by the orthogonalization loop in the outer GCR iteration. Because in GMRESR this is 
not taken into account in the inner GMRES iteration, a less than optimal minimization problem is solved, 
leading to suboptimal corrections [2] to the residual. Another disadvantage of GMRESR is that the inner 
iteration is essentially a restarted GMRES. It therefore also displays some of the problems of restarted 
GMRES. Most notably it can have the tendency to stagnate (see Numerical Experiments). 

From this we infer that we should preserve the orthogonality of the correction to the residual also in the 
inner GMRES iteration. In order to do this we use A k _ l = (I - C^Cl^A as the operator in the inner 
iteration. This gives the proper corrections to the residual: c k € K m {A k -\\ A k -\T k -{). However, the 
corresponding corrections to the approximate solution (contrary to ordinary implementations of Krylov 
methods) are found by u k = A~ l c k G A~ 1 K Tn (A k -i] A k ~ir k -i)- These corrections can be computed since 
the inverse of A is known over this space. Equation (5) gives: 

A~ l A k _i = A- 1 A - A~ 1 C k - 1 C k _ 1 A = I - U^C^A. (6) 

This leads to a variant of the GMRESR iteration scheme, which has an improved performance for many 
problems. 

In this article we will consider GMRES and BICGSTAB as inner methods. In the next section we will 
discuss the implications of the orthogonalization in the inner method. It will be proved that this leads to 
an optimal approximation over the space spanned by both the outer and the inner iteration vectors. It also 
introduces a potential problem: the possibility of breakdown in the generation of the Krylov space in the 
inner iteration, since we iterate with a singular operator. We will show, however, that such a breakdown 
can never happen before a specific (generally large) number of iterations. Furthermore, we will also show 
how to remedy such a breakdown. We will also discuss the efficient implementation of these methods and 
how we can truncate the outer GCR iteration. Outlines of the algorithms can be found in [7], [2]. 


CONSEQUENCES OF INNER ORTHOGONALIZATION 

To keep this section concise, we will only give a short indication of the proofs or omit them completely. 

The proofs can be found in [2j. Throughout the rest of this article we will use the following notations: 
o By U k = [iq . . . u k ] and C k = [c\ . . . c k ] we denote matrices that satisfy the relations (1); 
o By x k and r k we denote the vectors that satisfy the relations (2)-(4); 
o By P k and Q k we denote the projections defined as P k = CkCfj^ and Q k = U k Cj.A\ 
o By A k we denote the operator defined as A k = (I - P k )A\ 

o By V m = \v\ . . . , v m \ we denote the orthonormal matrix generated by m steps of Arnoldi (GMRES) 
with A k and such that V\ = r k /\\r k \\ 2 . 

From this and (6) it then follows that 

AQ k = P k A, and A~ l A k = (/ - Qk)- ( 7 ) 
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We will describe the (fc + l)-th step of our variant of the GMRESR iteration scheme, where in the inner 
GMRES iteration the modified operator Ak is used. We use m (not fixed) steps of the GMRES algorithm 
to compute the correction to r^+i in the space K Tn (Af c ; A^r*). This leads to the optimal correction to the 
approximate solution Xk+i over the ‘global* space range(Uk+i) © A~ x K m {Ak] A^r^) - 


Theorem 1 The Amoldi process in the inner GMRES iteration defines the relation A^Vm = Vm+iHm, 
with H m an ((m + 1) x m) Hessenberg matrix. Let y be defined by 

V ’■ min Ikfc - A k V m y || 2 = .nun \\r k - V^+i H m yh- ( 8 ) 


Then the minimal residual solution of the inner GMRES iteration: {A 1 A k V m y ) gives the outer 
approximation 


Zfc+1 = x k + (J - Qk)V m y, 

(9) 

which is also the solution to the l globaV minimization problem 


Xfc+1 : min \\b - Ax || 2 

(10) 

ir€ran )© 

range(Vm ) 



It also follows from this theorem that the GCR optimization (in the outer iteration) is given by ( 9 ), so that 
the residual computed in the inner GMRES iteration equals the residual of the outer GCR iteration: 

7-fc+i = b — Axk+i = b — Axk — AkVmy = Tk — Ak V m y. From this it follows that in the outer GCR iteration 
the vectors Uk + i and Cfc+i are given by 

Ck+ 1 = (AkV^/WAkVmyWz, (11) 

ttfc+i = ((I-Qk)Vmy)/\\A k V m y\\ 2 . (12) 

Note that ( I — QkYVmV has been computed already as the the approximate solution in the inner GMRES 
iteration; see (9), and AkVmy is easily computed from the relation AkV m y = V m +\ Hy m - Moreover, as a 
result of using GMRES in the inner iteration, the norm of the residual r*+i as well as the norm of AkV m y 
is already known at no extra computational costs. Consequently, the outer GCR iteration becomes very 
simple. 

We will now consider the possibility of breakdown when generating a Krylov space with a singular, 
nonsymmetric operator. Although GMRES is still optimal in the sense that at each iteration it delivers the 
minimum residual solution over the generated Krylov subspace, the generation of the Krylov subspace 
itself, from a singular operator, may terminate too early. The following simple example shows that this 
may happen before the solution is found, even when the solution and the right hand side are both in the 
range of the given (singular) operator and in the orthogonal complement of its null-space. 

Define the matrix A = (e2 e 3 e 4 0 ), where e* denotes the i-th Cartesian basis vector. Note that 
A = (/ — eiej)(e 2 e\ ei), which is the same type of operator as Ak , an orthogonal projection times a 
nonsingular operator. Now consider the system of equations Ax — e$. Then GMRES (or any other Krylov 
method) will search for a solution in the space 

span{ez, Ae 3 , A 2 e 3 , .. .} = span{e^e^ 0,0,...}. 

So we have a breakdown of the Krylov space and the solution is not contained in it. We remark that the 
singular unsymmetric case is quite different from the symmetric one. 
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In the remainder of this section we will prove that a breakdown in the inner GMRES method cannot occur 
before the total number of iterations exceeds the dimension of the Krylov space K{A\ ro). This means that, 
in practice, a breakdown will be rare. Furthermore, we will show how such a breakdown can be overcome. 

We will now define breakdown of the Krylov space for the inner GMRES iteration more formally. 


Definition 1 We say there is a breakdown of the Krylov subspace in the inner GMRES iteration if 
A k v m £ ran< 7 e(V m ) , since this implies we can no longer expand the Krylov subspace. We call it a lucky 
breakdown if vi E range(A k V m ) , because we then have found the solution (the inverse of A is known 
over the space range(A k V m ) ). We call it a true breakdown if v i £ range(A k V m ) , because then the 
solution is not contained in the Krylov subspace . 


The following theorem relates true breakdown to the invariance of the sequence of subspaces in the inner 
method for the operator A k . Part four indicates that it is always known whether a breakdown is true or 
lucky. 

Theorem 2 The following statements are equivalent: 

1. A true breakdown occurs in the inner GMRES iteration at step m; 

2. range{A k V m -\ ) is an invariant subspace of A k ; 

5. A k v m E range(A k Vm-i) ; 

4- A k V m = V m H m , and H m is a singular m x m matrix. 

From theorem 1, one can already conclude that a true breakdown occurs if and only if A k is singular over 
K m (A k ]r k )• From the definition of A k we know null{A k ) = range{U k ) * We will make this more explicit 
in the following theorem, which relates true breakdown to the intersection of the inner search space and the 
outer search space. 


Theorem 3 A true breakdown occurs if and only if 

range(V m ) H range(U k ) ^ {0}. 


The following theorem indicates that no true breakdown in the inner GMRES iteration can occur before 
the total number of iterations exceeds the dimension of the Krylov space K{A\r$). 


Theorem 4 Let m = dim(K(A; ro)) and let l be such that r k = Vi{A)r$ for some polynomial Vi of degree 
l. Then 


dim(K- 7+1 (>l it ; r 0 )) = j + 1 for j + l < m 


and therefore no true breakdown occurs in the first j steps of the inner GMRES iteration. 


We will now show how a true breakdown can be overcome. There are basically two ways to continue: 
In the inner iteration: by finding a suitable vector to expand the Krylov space. 
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In the outer iteration: by computing the solution of the inner iteration just before the true breakdown 
and then by making one LSQR-step (see below) in the outer iteration* 

We will consider the continuation in the inner GMRES iteration first. The following theorem indicates how 
one can continue the generation of the Krylov space K(A\ r*) if in the inner GMRES iteration a true 
breakdown occurs. 

Theorem 5 If a true breakdown occurs in the inner GMRES iteration then 

3c € range(C k ) : A k c & range{A k V m - 1) (13) 

This implies that one can try the vectors c* until one of them works. However, one should realize that the 
minimization problem (8) is slightly more complicated. 

Another way to continue after a true breakdown in the inner GMRES iteration is to compute the inner 
iteration solution just before the breakdown and then apply an LSQR-switch (see below) in the outer GCR 
iteration. The following theorem states the reason why one has to apply an LSQR-switch. 

Theorem 6 Suppose one computes the solution of the inner GMRES iteration just before a true 
breakdown. Then stagnation will occur in the next inner iteration, that is r k+ i X K(A k+ y r fc+ i). This will 
lead to a breakdown of the outer GCR iteration. 

The reason for this stagnation in the inner GMRES iteration is that the new residual 77; + 1 remains in the 
same Krylov space K{A k ; r k ), which contains a u € range(U k ) . So we have to ‘leave’ this Krylov space. 
We can do this using the so-called LSQR-switch, which was introduced in [15], to remedy stagnation in the 
inner GMRES iteration. Just as in the GMRESR method, stagnation in the inner GMRES iteration will 
result in a breakdown in the outer GCR iteration, because the residual cannot be updated. The following 
theorem states that this LSQR-switch actually works. 

Theorem 7 If stagnation occurs in the inner GMRES iteration, that is if 
min^gR- ||r fc+ i - A k V m y\\ 2 = ||r fc+1 || 2 , then one can continue by setting (LSQR-switch) 

c k+ 2 = jA k+1 A T r k+ i and (M) 

u k +2 = lf{I - Qk+i)A T r k +i, (I 5 ) 

where 7 = ||cfc + 2||2+ This leads to 

r k+ 2 = r k+ 1 - (rl +1 c k+2 )c k +2 and (16) 

Xfc+2 = Zfc+l - (rjfc + iCfc + 2)u*+2, ( 17 ) 

which always gives an improved approximation. Therefore, these vectors can be used as the start vectors for 
a new inner GMRES iteration. 


IMPLEMENTATION 


We will now describe how to implement these methods efficiently (see also [2], [7]). First we will discuss the 
outer GCR iteration and then the inner GMRES iteration. The implementation of a method like 
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BICGSTAB in the inner iteration will then be obvious. Instead of the matrices U k and C k we will use in 
the actual implementation the matrices U k ,C k , N k , Z k and the vector d k which are defined below. 


Definition 2 The matrices U k , C k , N k , Z k and the vector d k are defined as follows. 

C k = C k N k , where (18) 

N k = dia 5 (||ci|| 2 - 1 ,||c 2 || 2 - 1 ,...,||c Jt || 2 - 1 ), (19) 

AU k = C k Z k , (20) 

where Z k is assumed to be upper-triangular. Finally dk is defined by the relation 

r k =r 0 -C k d k (21) 

From this the approximate solution x k , corresponding to r k , is implicitly represented as 

x k = x 0 + U k Z k l d k . (22) 


Using this relation Xk can be computed at the end of the complete iteration or before truncation (see next 
section). The implicit representation of Uk saves all the intermediate updates of previous it* to a new 
which is approximately 50% of the computational costs in the outer GCR iteration (see (11) and (12)). 

GMRES as inner iteration. After fc outer GCR iterations we have C/it, C k and r k . Then, in the inner 
GMRES iteration, the orthogonal matrix V m +i is constructed such that V m +i = O and 

AV m = C k B m + V rn ^H m (23) 

B m = NlClAV m (24) 

This algorithm is equivalent to the usual GMRES algorithm, except that the vectors Avi are first 
orthogonalized on C k . Prom (23) and (24) it is obvious that AV m - C k B m = AkV m = V m +i Hm (cf. 
theorem 1). Next we compute y according to (8) and we set (cf. (11) without normalization): 

Cfc+i = Vm+iHmy ( 25 ) 

Ufc+l — VmV‘ ( 2 ®) 

This leads to Au k +i = AV m y = C k B m y + Vm+iHmy = C k Bmy + c k + u so that if we set z k + 1 = (( B rn y) T 1) T 
the relation AU k +i = C k +\Z k +\ is again satisfied. It follows from theorem 1 that the new residual of the 
outer GCR iterations is equal to the final residual of the inner iteration r k +i = r^ ner and is given by 
r k+ i = r k — Cfc+i, so that d k + \ = 1. Obviously the residual norm only needs to be computed once. If we 
replace, in the formula above, the new residual of the outer GCR iteration r*+i by the residual of the inner 
GMRES iteration r^ ner , we see an important relation that holds more generally c k + \ = r k — rj£ ncr . This 
relation is important, since in general (when other Krylov methods are used for the inner iteration) c k +\ or 
c k + 1 cannot be computed from u kJr i, because u k + 1 is not always computed explicitly, nor does a relation 
like (25) always exist. Finally, we need to compute the new coefficient of N k + 1 , H^ 1 in order to satisfy 

the relations in definition 2. 


TRUNCATION 


In practice, since memory space may be limited and since the method becomes increasingly expensive for 
large k (the number of outer search vectors), we want to truncate the set of outer iteration vectors (u$) and 
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(ci) at k = k max , where fc max is some positive integer. Basically, there are two ways to do this: one can 
discard one or more iteration vector(s) (dropping) or one can assemble two or more iteration vectors into 
one single iteration vector (assembly). We will first discuss the strategy for truncation and then its 
implementation. 


A strategy for Truncation . In each outer GCR iteration step the matrices Uk and C * are augmented with 
one extra column. To keep the memory requirement constant, at step k = kmax, it is therefore sufficient to 
diminish the matrices Uk max and Ck max by one column. Prom (22) we have x* = xo + UkZ k l dk : Denote 
= Z k l dk- Consider the sequence of vectors (£*). The components of these vectors £* are the 
coefficients for the updates Ui of the approximate solution x* . These coefficients converge to the limits 
£b) as k increases. Moreover, (£k^) converges faster than (£*^), and (£a/ 2 ^) converges faster than (£a/ 3 ^) 
etc. . Suppose that the sequence {£k^) has converged to within machine precision. EYom then on it 
makes no difference for the computation of x* when we perform the update xo + £^ui. In terms of 
direction vectors this means that the outer direction vector u\ will not reenter as component in the inner 
iteration process. Therefore one might hope that discarding the vector c\ will not spoil the convergence. 
This leads to the idea of dropping the vector ci(= Au\) or of assembling c\ with C 2 into c (say) when 

*(*)- 

where e > 0 is a small constant. The optimal e, which may depend on k, can be determined from 
experiments. When 6 (k) > e we drop c^ max _ 1 or we assemble c * j7iaj _ l and (of course other choices are 

feasible as well, but we will not consider them in this article). With this strategy we hope to avoid 
stagnation by keeping the most relevant part of the subspace range(Ck ) in store as a subspace of 
dimension k — 1. In the next subsections we describe how to implement this strategy and its consequences 
for the matrices C k and Uk . 


*(i) 

Cfc-l $k 


M) 

V k 


< e, 


(27) 


Dropping a vector . Let 1 < j < A: = k max . Dropping the column Cj is easy. We can discard it without 
consequences. So let C k _ x be the matrix without the column cj . Dropping a column from Uk needs 
more work, since x& is computed as Xk ~ xo + UkZ k l dk* Moreover, in order to be able to apply the same 
strategy in the next outer iteration we have to be able to compute x^+i in a similar way. For that purpose, 
assume that x* can be computed as 


x k — x k~ 1 x 0 + U'k - 1 (^fc-l) 




(28) 


where U k _ a and Z k _ Y are matrices such that AU k _ x = C k _ x Z f k _ x (see (20)). These matrices and Z k _^ 
are easily computed by using the j-th row of (20) to eliminate the j- th column of Ck in (20). In order to 
determine Xq and d f k _ x we introduce the matrix Uk — A~ l Ck = UkZ k l . This enables us to write 


Xk = (xo + d^Uj) + Y^ d k )u i and u 3 = - YL z ii u i)/ z ii- 




3 - 1 


(29) 


i=l 


i = 1 


Substituting the equation for Uj into the equation for Xk we can compute Xk from 

i (j) j — 1 k 

Xk = (xo + —Uj) + Yl( d k ] - d k + Y d^Ui. 

z 33 i=l Z 33 i=j+l 


(30) 
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Notice that this equation precisely defines x' 0 and d' k _ : : 

4-i' = 4° ~ d k ] ( z a/ z jj ) for i = 1, . . . , j - 1 and (31) 

4-i' = 4 +1) for i = 1. 

Now we have deallocated two vectors and we compute as in (28). We can continue the algorithm. 

Assembly of two vectors. Let 1 < j < l < k = k max . Again assembling cj and cj is easy. Let 
c = (d k ^Cj + dfj^ci) overwrite the 1-th column of C k . Then, let C' k _^ be this new matrix C k without j - th 
column. Analogous to the above, we wish to compute x * as (28). For the purpose of determining the 
matrices U' k _^ and Z' k _ l , let u = {d k ^Hj + dj^tq) and compute t^ and such that 

ZjmUj + zimUl + t^Uj = t^u, which gives t[ m ^ = z lm (d^ /<$) - z jm and t^ = z lm /<$ . This enables us 
to write u m = for m - 1, . . . ,j - 1 and 

m 

“m = z irnUi + t^U - t^Uj, for 771 = j k. (32) 

tel 

i*3,l 

Substituting Uj = (uj - 2j=i z ijUi)/zjj , to eliminate Uj from (32) we get u m = a z im Ui, for 
m = 1, . . . , j — 1 and 

^ (to) m 

“m + — — Uj = 53 (Zim + + 4 m) « for m = j + 1, . . . , k. (33) 

Z 33 i.i z 33 

This equation determines the matrices U k _ l and Z k _ 1 . In order to determine x' Q and d k _ 1 , note that x* can 
be computed as 

k 

= x 0 + 53 4 l) * + «• ( 34 ) 

t=i 

Therefore Xq is just xq and d' k _ j equals the vector without the j-th element and the 1-th element 
overwritten by 1. Similarly, as before, we have deallocated two vectors from memory. The assembled 
vectors it and c overwrite u; and cj . The locations of iij and Cj can therefore be used in the next step. 
Finally, we remark that these computations can be done with rank one updates. 

NUMERICAL EXPERIMENTS 


We will discuss the results of some numerical experiments, which concern the solution of two dimensional 
convection diffusion problems on regular grids, discretized using a finite volume technique, resulting in a 
pentadiagonal matrix. The system is preconditioned with ILU applied to the scaled system; see [3], [9]. The 
first two problems are used to illustrate and compare the following solvers: 

• (full) GMRES; 

• BICGSTAB; 

• GMRESR(m), where m indicates the number of inner GMRES iterations between the outer iterations; 

• GCRO(m), which is GCR with m adapted GMRES iterations as inner method, using A k \ 

• GMRESRSTAB, which is GMRESR with BICGSTAB as inner method; 
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Figure 3: Convergence history for problem 1 Figure 4: Convergence in time for problem 1 

• and GCROSTAB, which is GCR with the adapted BICGSTAB as inner method, using A k . 

We will compare the convergence of these methods both with respect to the number of matrix vector 
products and with respect to CPU-time on one processor of the Convex 3840. This means e.g. that each 
step of BICGSTAB (and variants) is counted for two matrix vector products. We give both these 
convergence rates because the main trade off between (full) GMRES, the GCR.0 variants and the 
GMRESR variants is less iterations against more dot products and vector updates per iteration. Any gain 
in CPU-time then depends on the relative cost of the matrix vector multiplication and preconditioning 
versus the orthogonalization cost on the one hand and on the difference in iterations on the other hand. We 
will use our third problem to show the effects of truncation and compare two strategies. 

Problem 1. This problem comes from the discretization of 

(Uxx “h u yy) ~b bv-x OUy 0 

on [0, 1] x [0,4], where 

f 100 for 0 < y < 1 and 2 < y < 3 
b{x,y)- < _ 1Q0 f()r i < y < 2 and 3<y<4 

and c ~ 100. The boundary conditions are « = 1 on y = 0, u = 0 on y = 4, u' = 0 on i = 0 and v! = 0 on 
i = l, where u' denotes the (outward) normal derivative. The stepsize in x-direction is 1/100 and in 
y-direction is 1/50. 

In this example we compare the performances of GMRES, GCRO(m) and GMRESR(m), for m = 5 and 
m = 10. The convergence history of problem 1 is given in Fig. 3 and Fig. 4. Fig. 3 shows that GMRES 
converges fastest (in matrix vector products), which is of course to be expected, followed by GCRO(5), 
GMRESR(5), GCRO(IO) and GMRESR(IO). From Fig. 3 we also see that GCRO(m) converges smoother 
and faster than GMRESR(m). Note that GCRO(5) has practically the same convergence behavior as 
GMRES. The vertical ‘steps’ of GMRESR(m) are caused by the optimization in the outer GCR iteration, 
which does not involve a matrix vector multiplication. We also observe that the GMRESR(m) variants 
tend to lose their superlinear convergent behavior, at least during certain stages of the convergence history. 
This seems to be caused by stagnation or slow convergence in the inner GMRES iteration, which (of 
course) essentially behaves like a restarted GMRES. For GCRO(m), however, we see a much smoother and 
faster convergence behavior and the superlinearity of (full) GMRES is preserved. This is explained by the 
‘global’ optimization over both the inner and the outer search vectors (the latter form a sample of the 
entire, previously searched Krylov subspace). So we may view this as a semi-full gmres. Fig. 4 gives the 
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Figure 5: Convergence history for problem 2 Figure 6: Convergence history for BICGSTAB 

variants for problem 2 




Figure 7: Convergence in time for problem 2 


Figure 8: Coefficients for problem 2 


convergence with respect to CPU-time. In this example GCRO(5) is the fastest, which is not surprising in 
view of the fact that it converges almost as fast as GMRES, but against much lower costs. Also, we see 
that GCRO(IO), while slower than GMRESR(5), is still faster than GMRESR(IO). In this case the extra 
orthogonalization costs in GCRO are outweighed by the improved convergence behavior. 

Problem. 2. This problem is taken from [14]. The linear system comes from the discretization of 


~(au x ) x - ( auy) y + bu x = f 


on the unit square, with b = 2 exp 2(x 2 + y 2 ). Along the boundaries we have Dirichlet conditions: u = 1 for 
y — 0, x = 0 and x = 1, and u = 0 for y = 1. The functions a and / are defined as shown in Fig. 8; / = 0 
everywhere, except for the small subsquare in the center where / = 100. The stepsize in x-direction and in 
y-direction is 1/128. 

If Fig. 5 a convergence plot is given for (full) GMRES, GCRO(m) and GMRESR(m). We used m = 10 and 
m = 50 to illustrate the difference in convergence behavior in the inner GMRES iteration of GMRESR(m) 
and GCRO(m). GMRESR(50) stagnates in the inner GMRES iteration whereas GCRO(50) more or less 
displays the same convergence behavior as GCRO(IO) and full GMRES. For the number of matrix vector 
products, it seems that for GMRESR(m) small m are the best choice. 
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In Fig. 6 a convergence plot is given for (full) GMRES, BICGSTAB, and the the BICGSTAB variants, 
GMRESRSTAB and GCROSTAB. To our experience the following strategy gave the best results for the 
BICGSTAB variants: 

• For GMRESRSTAB we ended an inner iteration after either 20 steps or a relative improvement of the 
residual of 0.01; 

• For GCROSTAB we ended an inner iteration after either after 25 steps or a relative improvement of the 
residual of 0.01. 

The convergence of GMRESRSTAB for this example is somewhat typical for GMRESRSTAB in general 
(albeit very bad in this case). This might be explained from the fact that the convergence of BICGSTAB 
depends on a ‘shadow 1 Krylov subspace, which it implicitly generates. Now, if if one restarts, then 
BICGSTAB also starts to build a new, possibly different, ‘shadow’ Krylov subspace. This may lead to 
erratically convergent behavior in the first few steps. Therefore, it may happen that, if in the inner 
iteration BICGSTAB does not converge (to the relative precision), the Solution’ of the inner iteration is 
not very good and therefore the outer iteration may not give much improvement either. At the start the 
same more or less holds for GCROSTAB; however, after a few outer GCR iterations the ‘improved’ 
operator ( Ak ) somehow yields a better convergence than BICGSTAB by itself. This was also observed for 
more tests, although it also may happen that GCROSTAB converges worse than BICGSTAB. 

In Fig. 7 a convergence plot versus the CPU-time is given for GCROSTAB, BICGSTAB, GCRO(IO) and 
GMRESR(IO). The fastest convergence in CPU-time is achieved by GCROSTAB(IO), which is ^ 20% 
faster than BICGSTAB notwithstanding the extra work in orthogonalizations. We also see, that although 
GCRO(IO) takes fewer iterations than GMRESR(IO), in CPU-time the latter is faster. So in this case the 
decrease in iterations does not outweigh the extra work in orthogonalizations. For completeness we mention 
that GMRESRSTAB took almost 15 seconds to converge, whereas GMRES took almost 20 seconds. 

Problem 3. The third problem is taken from [10]. The linear system stems from the discretization of the 
partial differential equation 

-u xx - Uyy + lOOO^u* + yUy) + lOu = / 

on the unit square with zero Dirichlet boundary conditions. The stepsize in both x-direction and 
y-direction is 1/65. The right-hand side is selected once the matrix is constructed so that the solution is 
known to be x — (1, 1, ... , l) r . The zero vector was used as an initial guess. 

In Fig. 9 we see a plot of the convergence history of full GMRES, GMRESR(5), GCRO(5) and 
GCRO(10,5) for two different truncation strategies, where the first parameter gives the dimension of the 
outer search space and the second the dimension of the inner search space. The number of vectors in the 
outer GCR iteration is twice the dimension of the search space. For the truncated version: 

• ‘da’ means that we took c = 10~ 3 and dropped the vectors ii\ and c\ when 6(k) < e and assembled the 
vectors iig and uio as well as the vectors eg and cio when <5 ( k ) > e; 

• ‘tr’ means that we dropped the vectors ug and eg each step (e = 0, see also [16]). 

Notice that GCRO(5) displays almost the same convergence behavior as full GMRES. GMRESR(5) 
converges eventually, but only after a long period of stagnation. The truncated versions of GCRO(5) also 
display stagnation, but for a much shorter period. After that the ‘da’ version seems to converge as 
superlinear, whereas the ‘tr’ version still displays periods of stagnation, most notably at the end. This 
indicates that the ‘da’ version is more capable of keeping most of the ‘convergence history’ than the ‘tr’ 
version. This kind of behavior was seen in more tests: ‘assembled’ truncation strategies seem to work 
better than just discarding one or more iteration vectors. 

In Table 1 we give the number of matrix vector products, the number of memory vectors and the 
CPU-time on a Sun workstation. From this table we see that GCRO(5) is by far the fastest method and 
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Figure 9: Convergence history for problem 3 


uses about half the amount of memory vectors full GMRES and GMRESR(5) use. More interesting is that 
GCRO(10,5) ‘da’ converges in the same time as GMRESR(5), but uses only one third of the memory space. 


CONCLUSIONS 


We have derived from the GMRESR inner-outer iteration schemes a modified set of schemes, which 
preserve the optimality of the outer iteration. This optimality is lost in GMRESR since it essentially uses 
‘restarted’ inner GMRES iterations, which do not take advantage of the outer ‘convergence history’. 
Therefore, GMRESR may loose superlinear convergence behavior, due to stagnation or slow convergence of 
the inner GMRES iterations. 


Method 

Mat-Vec 

Memory Vectors 

CPU-time 

GMRES 

77 

77 

21.3 

GMRESR(5) 

188 

81 

18.5 

GCRO(5) 

83 

39 

9.4 

GCRO(10,5) ‘da’ 

150 

25 

18.3 

GCRO(10,5) *tr’ 

244 

25 

30.3 


Table 1: Number of matrix vector products, number of memory 
vectors and CPU-time in seconds for problem 3 
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In contrast, the GCRO variants exploit the ‘convergence history’ to generate a search space that has no 
components in any of the outer directions in which we have already minimized the error. For GCRO(m) 
this means we minimize the error over both the inner search space and a sample of the entire previously 
searched Krylov subspace (the outer search space), resulting in a semi-full GMRES. This probably leads to 
the smooth convergence (much like GMRES) and the absence of stagnation, which may occur in the inner 
GMRES iteration of GMRESR. Apparently the small subset of Krylov subspace vectors that is kept 
approximates the entire Krylov subspace that is generated, sufficiently well. For both GMRESR(m) and 
GCRO(m) it seems that a small number of inner iterations works well. 

We may also say, that the GCRO variants construct a new (improved) operator (of decreasing rank) after 
each outer GCR iteration. Although there is the possibility of breakdown in the inner method for GCRO, 
this seems to occur rarely as is indicated by theorem 4 (it has never happened in any of our experiments). 

With respect to performance of the discussed methods we see that GCRO(m) (almost) always converges in 
fewer iterations than GMRESR(m). Because GCRO(m) is on average more expensive per iteration, this 
does not always lead to faster convergence in CPU-time. This depends on the relative costs of the matrix 
vector product and preconditioner w.r.t. the cost of the orthogonalizations and the reduction in iterations 
for GCRO(m) relative to GMRESR(m). Our experiments, with a cheap matrix vector product and 
preconditioner, show that already in this case the GCRO variants are very competitive with other solvers. 
However, especially when the matrix vector product and preconditioner are expensive or when not enough 
memory is available for (full) GMRES, GCRO(m) is very attractive, GCRO with BICGSTAB also seems 
to be a useful method, especially when a large number of iterations is necessary or when the available 
memory space is small relative to the problem size. GMRESR with BICGSTAB does not seem to work so 
well, probably because, to our observation, restarting BICGSTAB does not work so well. 

We have derived sophisticated truncation strategies and shown by example that superlinear convergence 
behavior can be maintained. From our experience, the ‘assembled’ version seems to have the most promise. 

Acknowledgements. The authors are grateful to Gerard Sleijpen and Henk van der Vorst for 
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